Home Depot Product Search Relevance

The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

LabGraph Create

This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.



In [1]:

    
import graphlab as gl
from nltk.stem import *

Load data from CSV files



In [2]:

    
train = gl.SFrame.read_csv("../data/train.csv")









    



[INFO] This non-commercial license of GraphLab Create is assigned to thomasv1000@hotmail.fr and will expire on October 12, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-34069 - Server binary: /Users/tjaskula/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1455056183.log
[INFO] GraphLab Server Version: 1.8.1






    



PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.123565 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 74067 lines in 0.1662 secs.



In [3]:

    
test = gl.SFrame.read_csv("../data/test.csv")









    



PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.210436 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 166693 lines in 0.321425 secs.



In [4]:

    
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")









    



PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.512102 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 61134 lines. Lines per second: 61129.8
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 124428 lines in 1.5747 secs.

Data merging



In [5]:

    
# merge train with description
train = train.join(desc, on = 'product_uid', how = 'left')



In [6]:

    
# merge test with description
test = test.join(desc, on = 'product_uid', how = 'left')

Let's explore some data

Let's examine 3 different queries and products:

first from the training set
somewhere in the moddle in the training set
the last one from the training set



In [7]:

    
first_doc = train[0]
first_doc









    Out[7]:





{'id': 2,
 'product_description': 'Not only do angles make joints stronger, they also provide more consistent, straight corners. Simpson Strong-Tie offers a wide variety of angles in various sizes and thicknesses to handle light-duty jobs or projects where a structural connection is needed. Some can be bent (skewed) to match the project. For outdoor projects or those where moisture is present, use our ZMAX zinc-coated connectors, which provide extra resistance against corrosion (look for a "Z" at the end of the model number).Versatile connector for various 90 connections and home repair projectsStronger than angled nailing or screw fastening aloneHelp ensure joints are consistently straight and strongDimensions: 3 in. x 3 in. x 1-1/2 in.Made from 12-Gauge steelGalvanized for extra corrosion resistanceInstall with 10d common nails or #9 x 1-1/2 in. Strong-Drive SD screws',
 'product_title': 'Simpson Strong-Tie 12-Gauge Angle',
 'product_uid': 100001,
 'relevance': 3.0,
 'search_term': 'angle bracket'}

'angle bracket' search term is not contained in the body. 'angle' would be after stemming however 'bracket' is not.



In [8]:

    
middle_doc = train[37033]
middle_doc









    Out[8]:





{'id': 113228,
 'product_description': 'PureBond Plywood Project Panels are a convenient and cost-effective way to build cabinets, furniture and other woodworking projects. It provides a beautiful wood veneer face bonded to a strong and flat wood core. These PureBond Project Panels are made with no added formaldehyde, eliminating the concern about off-gassing dangerous fumes during fabrication or when installed in your home. Their smaller size makes them easy to handle and allows you to order just the amount of wood you need. PureBond plywood, in Project Panels sizes or in full sheet sizes, are a Home Depot exclusive.California residents: see&nbsp;Proposition 65 informationDecorative mahogany veneer applied to both sides of this panelB-2 plain sliced mahogany - 7-ply constructionLight weight, all-wood veneer constructionPrecision-cut hardwood plywood panels in convenient small sizesCommon: 3/4 in. x 2 ft. x 4 ft.; Actual: 0.703 in. x 24 in. x 48 in.Grade: B-2',
 'product_title': '3/4 in. x 2 ft. x 4 ft. PureBond Mahogany Plywood Project Panel',
 'product_uid': 137334,
 'relevance': 3.0,
 'search_term': 'table top wood'}

only 'wood' is present from search term



In [9]:

    
last_doc = train[-1]
last_doc









    Out[9]:





{'id': 221473,
 'product_description': 'No. 918 Millennial Ryan heathered texture semi-sheer curtain is a casual solid that adds freshness and a finishing touch to any decor setting. Enhances privacy while allowing light to gently filter through. Clean, simple one-pocket pole top design can be used with a standard or decorative curtain rod. Mix and match with other solids and prints for a look that is all your own.Sheer panel, gently filters lightNo header pole top panelMachine washableWide array of colors to choose from100% polyesterContains 1-curtain panel',
 'product_title': 'LICHTENBERG Pool Blue No. 918 Millennial Ryan Heathered Texture Sheer Curtain Panel, 40 in. W x 63 in. L',
 'product_uid': 206650,
 'relevance': 2.33,
 'search_term': 'fine sheer curtain 63 inches'}

'sheer' and 'courtain' are present and that's all

How many search terms are not present in description and title for ranked 3 documents

Ranked 3 documents are the most relevents searches, but how many search queries doesn't include the searched term in the description and the title



In [10]:

    
train['search_term_word_count'] = gl.text_analytics.count_words(train['search_term'])
ranked3doc = train[train['relevance'] == 3]
print ranked3doc.head()
len(ranked3doc)









    



+-----+-------------+-------------------------------+
|  id | product_uid |         product_title         |
+-----+-------------+-------------------------------+
|  2  |    100001   | Simpson Strong-Tie 12-Gaug... |
|  9  |    100002   | BEHR Premium Textured Deck... |
|  18 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  21 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  27 |    100009   | House of Fara 3/4 in. x 3 ... |
|  35 |    100011   | Toro Personal Pace Recycle... |
|  37 |    100011   | Toro Personal Pace Recycle... |
|  65 |    100016   | Sunjoy Calais 8 ft. x 5 ft... |
| 123 |    100023   | Quikrete 80 lb. Crack-Resi... |
| 162 |    100029   | DecoArt Americana Decor 16... |
+-----+-------------+-------------------------------+
+--------------------------------+-----------+-------------------------------+
|          search_term           | relevance |      product_description      |
+--------------------------------+-----------+-------------------------------+
|         angle bracket          |    3.0    | Not only do angles make jo... |
|           deck over            |    3.0    | BEHR Premium Textured DECK... |
|         convection otr         |    3.0    | Achieving delicious result... |
|           microwaves           |    3.0    | Achieving delicious result... |
|            mdf 3/4             |    3.0    | Get the House of Fara 3/4 ... |
| briggs and stratton lawn mower |    3.0    | Recycler 22 in. Personal P... |
|            gas mowe            |    3.0    | Recycler 22 in. Personal P... |
|          grill gazebo          |    3.0    | Make grilling great with t... |
| CONCRETE & MASONRY CLEANER...  |    3.0    | Quikrete 80 lb. Crack-Resi... |
|          chalk paint           |    3.0    | Achieving a vintage, time-... |
+--------------------------------+-----------+-------------------------------+
+-------------------------------+
|     search_term_word_count    |
+-------------------------------+
|   {'bracket': 1, 'angle': 1}  |
|     {'over': 1, 'deck': 1}    |
|  {'otr': 1, 'convection': 1}  |
|       {'microwaves': 1}       |
|      {'mdf': 1, '3/4': 1}     |
| {'and': 1, 'stratton': 1, ... |
|     {'gas': 1, 'mowe': 1}     |
|   {'grill': 1, 'gazebo': 1}   |
| {'etcher': 1, 'cleaner': 1... |
|    {'chalk': 1, 'paint': 1}   |
+-------------------------------+
[10 rows x 7 columns]







    Out[10]:





19125



In [11]:

    
words_search = gl.text_analytics.tokenize(ranked3doc['search_term'], to_lower = True)
words_description = gl.text_analytics.tokenize(ranked3doc['product_description'], to_lower = True)
words_title = gl.text_analytics.tokenize(ranked3doc['product_title'], to_lower = True)
wordsdiff_desc = []
wordsdiff_title = []
puid = []
search_term = []
ws_count = []
ws_count_used_desc = []
ws_count_used_title = []
for item in xrange(len(ranked3doc)):
    ws = words_search[item]
    pd = words_description[item]
    pt = words_title[item]
    diff = set(ws) - set(pd)
    if diff is None:
        diff = 0
    wordsdiff_desc.append(diff)
    
    diff2 = set(ws) - set(pt)
    if diff2 is None:
        diff2 = 0
    wordsdiff_title.append(diff2)
    
    puid.append(ranked3doc[item]['product_uid'])
    search_term.append(ranked3doc[item]['search_term'])
    ws_count.append(len(ws))
    ws_count_used_desc.append(len(ws) - len(diff))
    ws_count_used_title.append(len(ws) - len(diff2))
    
differences = gl.SFrame({"puid" : puid,
                         "search term": search_term,
                         "diff desc" : wordsdiff_desc,
                         "diff title" : wordsdiff_title,
                         "ws count" : ws_count, 
                         "ws count used desc" : ws_count_used_desc,
                         "ws count used title" : ws_count_used_title})



In [12]:

    
differences.sort(['ws count used desc', 'ws count used title'])









    Out[12]:





    
        diff desc
        diff title
        puid
        search term
        ws count
        ws count used desc
    
    
        [recycling, bins]
        [recycling, bins]
        145727
        recycling bins
        2
        0
    
    
        [over, deck]
        [over, deck]
        100002
        deck over
        2
        0
    
    
        [hammer, electric, drill]
        [hammer, electric, drill]
        120061
        electric hammer drill
        3
        0
    
    
        [microwaves]
        [microwaves]
        100006
        microwaves
        1
        0
    
    
        [plywoods]
        [plywoods]
        119996
        plywoods
        1
        0
    
    
        [coca, cola]
        [coca, cola]
        120276
        coca cola
        2
        0
    
    
        [greenhouses]
        [greenhouses]
        120318
        greenhouses
        1
        0
    
    
        [pipe, cutters]
        [pipe, cutters]
        119840
        pipe cutters
        2
        0
    
    
        [buit, themostat, in]
        [buit, themostat, in]
        206359
        buit in themostat
        3
        0
    
    
        [mowers, ridding]
        [mowers, ridding]
        120366
        ridding mowers
        2
        0
    


    
        ws count used title
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    
    
        0
    

[19125 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [13]:

    
print "No terms used in description : " + str(len(differences[differences['ws count used desc'] == 0]))
print "No terms used in title : " + str(len(differences[differences['ws count used title'] == 0]))
print "No terms used in description and title : " + str(len(differences[(differences['ws count used desc'] == 0) & 
                                                                        (differences['ws count used title'] == 0)]))









    



No terms used in description : 2666
No terms used in title : 2152
No terms used in description and title : 1206



In [14]:

    
import matplotlib.pyplot as plt
%matplotlib inline

Stemming



In [29]:

    
#stemmer = SnowballStemmer("english")
stemmer = PorterStemmer()
def stem(word):
    singles = [stemmer.stem(plural) for plural in unicode(word, errors='replace').split()]
    text = ' '.join(singles)
    return text



In [30]:

    
print "Starting stemming train search term..."
stemmed = train['search_term'].apply(stem)
train['stem_search_term'] = stemmed

print "Starting stemming train product description..."
stemmed = train['product_description'].apply(stem)
train['stem_product_description'] = stemmed

print "Starting stemming train product title..."
stemmed = train['product_title'].apply(stem)
train['stem_product_title'] = stemmed

print "Starting stemming test search term..."
stemmed = test['search_term'].apply(stem)
test['stem_search_term'] = stemmed

print "Starting stemming test product description..."
stemmed = test['product_description'].apply(stem)
test['stem_product_description'] = stemmed

print "Starting stemming test product title..."
stemmed = test['product_title'].apply(stem)
test['stem_product_title'] = stemmed









    



Starting stemming train search term...
Starting stemming train product description...
Starting stemming train product title...
Starting stemming test search term...
Starting stemming test product description...
Starting stemming test product title...

TF-IDF with linear regression



In [32]:

    
train['search_term_word_count'] = gl.text_analytics.count_words(train['stem_search_term'])
train_search_tfidf = gl.text_analytics.tf_idf(train['search_term_word_count'])



In [33]:

    
train['search_tfidf'] = train_search_tfidf



In [34]:

    
train['product_desc_word_count'] = gl.text_analytics.count_words(train['stem_product_description'])
train_desc_tfidf = gl.text_analytics.tf_idf(train['product_desc_word_count'])



In [35]:

    
train['desc_tfidf'] = train_desc_tfidf



In [36]:

    
train['product_title_word_count'] = gl.text_analytics.count_words(train['stem_product_title'])
train_title_tfidf = gl.text_analytics.tf_idf(train['product_title_word_count'])
train['title_tfidf'] = train_title_tfidf



In [48]:

    
train['distance_desc'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
#train['distance_desc_sqrt'] = train['distance_desc'] ** 2
train['distance_title'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))
#train['distance_title_sqrt'] = train['distance_title'] ** 3



In [50]:

    
model1 = gl.linear_regression.create(train, target = 'relevance', 
                                         features = ['distance_desc', 'distance_title'], 
                                         validation_set = None)
# model1 = gl.linear_regression.create(train, target = 'relevance', 
#                                         features = ['distance_desc', 'distance_desc_sqrt', 'distance_title', 'distance_title_sqrt'], 
#                                         validation_set = None)









    



PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 74067
PROGRESS: Number of features          : 2
PROGRESS: Number of unpacked features : 2
PROGRESS: Number of coefficients    : 3
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.054827     | 1.934252           | 0.502806      |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:



In [51]:

    
#let's take a look at the weights before we plot
model1.get("coefficients")









    Out[51]:





    
        name
        index
        value
        stderr
    
    
        (intercept)
        None
        3.36098120617
        0.0130126162735
    
    
        distance_desc
        None
        -0.471115683671
        0.0175727545026
    
    
        distance_title
        None
        -0.792854603251
        0.0120369738892
    

[3 rows x 4 columns]



In [53]:

    
test['search_term_word_count'] = gl.text_analytics.count_words(test['stem_search_term'])
test_search_tfidf = gl.text_analytics.tf_idf(test['search_term_word_count'])
test['search_tfidf'] = test_search_tfidf
test['product_desc_word_count'] = gl.text_analytics.count_words(test['stem_product_description'])
test_desc_tfidf = gl.text_analytics.tf_idf(test['product_desc_word_count'])
test['desc_tfidf'] = test_desc_tfidf
test['product_title_word_count'] = gl.text_analytics.count_words(test['stem_product_title'])
test_title_tfidf = gl.text_analytics.tf_idf(test['product_title_word_count'])
test['title_tfidf'] = test_title_tfidf

test['distance_desc'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
#test['distance_desc_sqrt'] = test['distance_desc'] ** 2
test['distance_title'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))
#test['distance_title_sqrt'] = test['distance_title'] ** 3



In [54]:

    
'''
predictions_test = model1.predict(test)
test_errors = predictions_test - test['relevance']
RSS_test = sum(test_errors * test_errors)
print RSS_test
'''









    Out[54]:





"\npredictions_test = model1.predict(test)\ntest_errors = predictions_test - test['relevance']\nRSS_test = sum(test_errors * test_errors)\nprint RSS_test\n"



In [55]:

    
predictions_test = model1.predict(test)
predictions_test









    Out[55]:





dtype: float
Rows: 166693
[2.1194586905641986, 2.097010919250315, 2.327318656459769, 2.3423416569379105, 2.291363750454904, 2.1292410028129387, 2.3700891659576886, 2.380158719246286, 2.1461522961232458, 2.6725514683935634, 2.4735444612741126, 2.3577916980297187, 2.527370985769263, 2.5464611241149497, 2.2433784781304618, 2.3610473142083634, 2.097010919250315, 2.615135319749975, 2.1428328384749085, 2.1832873644759365, 2.7538729574608336, 2.7335476206465064, 2.1493438560051157, 2.3267645430460764, 2.2512091489393167, 2.503199755290125, 2.097010919250315, 2.2981432458210644, 2.3803635622873522, 2.322596343031413, 2.519767096157715, 2.362660486577712, 2.1974497974380167, 2.309948689847278, 2.313598821940017, 2.341687147636327, 2.4205333515242975, 2.3366063448390766, 2.8853671419333744, 2.8633757709368384, 2.2552865952704444, 2.297532949563152, 2.165997301067405, 2.097010919250315, 2.4552670270468595, 2.3625494876731397, 2.5106462498135387, 2.6188007396757573, 2.61900376832135, 2.2454680169370245, 2.1036340833149754, 2.102527092843549, 2.122092869812211, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.103358463836085, 2.097010919250315, 2.097010919250315, 2.1267995136846127, 2.1269757879601405, 2.097010919250315, 2.4744936795651946, 2.4089294758879447, 2.360521564894908, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.097010919250315, 2.224736363810293, 2.097010919250315, 2.557408344984125, 2.243072745659523, 2.097010919250315, 2.097010919250315, 2.5501178083366147, 2.097010919250315, 2.480335363483051, 2.283461192815867, 2.283461192815867, 2.1872823298107584, 2.1683105172043, 2.3396260222904397, 2.387438457989694, 2.505582843823445, 2.418305094666935, 2.5793821608777128, 2.3349132621118036, 2.097010919250315, 2.5821088945080928, 2.3188142743790774, 2.270976568556923, 2.242832290300672, 2.4621407613916673, 2.338149998696164, 2.3557175413828784, ... ]



In [56]:

    
submission = gl.SFrame(test['id'])



In [57]:

    
submission.add_column(predictions_test)
submission.rename({'X1': 'id', 'X2':'relevance'})









    Out[57]:





    
        id
        relevance
    
    
        1
        2.11945869056
    
    
        4
        2.09701091925
    
    
        5
        2.32731865646
    
    
        6
        2.34234165694
    
    
        7
        2.29136375045
    
    
        8
        2.12924100281
    
    
        10
        2.37008916596
    
    
        11
        2.38015871925
    
    
        12
        2.14615229612
    
    
        13
        2.67255146839
    

[166693 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [58]:

    
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])



In [59]:

    
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))



In [60]:

    
submission.export_csv('../data/submission2.csv', quote_level = 3)



In [ ]:

    
#gl.canvas.set_target('ipynb')

diff desc	diff title	puid	search term	ws count
[recycling, bins]	[recycling, bins]	145727	recycling bins	2
[over, deck]	[over, deck]	100002	deck over	2
[hammer, electric, drill]	[hammer, electric, drill]	120061	electric hammer drill	3
[microwaves]	[microwaves]	100006	microwaves	1
[plywoods]	[plywoods]	119996	plywoods	1
[coca, cola]	[coca, cola]	120276	coca cola	2
[greenhouses]	[greenhouses]	120318	greenhouses	1
[pipe, cutters]	[pipe, cutters]	119840	pipe cutters	2
[buit, themostat, in]	[buit, themostat, in]	206359	buit in themostat	3
[mowers, ridding]	[mowers, ridding]	120366	ridding mowers	2

name	index	value	stderr
(intercept)	None	3.36098120617	0.0130126162735
distance_desc	None	-0.471115683671	0.0175727545026
distance_title	None	-0.792854603251	0.0120369738892

id	relevance
1	2.11945869056
4	2.09701091925
5	2.32731865646
6	2.34234165694
7	2.29136375045
8	2.12924100281
10	2.37008916596
11	2.38015871925
12	2.14615229612
13	2.67255146839